Multi-phase attention network for face super-resolution

Previous general super-resolution methods do not perform well in restoring the details structure information of face images. Prior and attribute-based face super-resolution methods have improved performance with extra trained results. However, they need an additional network and extra training data are challenging to obtain. To address these issues, we propose a Multi-phase Attention Network (MPAN). Specifically, our proposed MPAN builds on integrated residual attention groups (IRAG) and a concatenated attention module (CAM). The IRAG consists of residual channel attention blocks (RCAB) and an integrated attention module (IAM). Meanwhile, we use IRAG to bootstrap the face structures. We utilize the CAM to concentrate on informative layers, hence improving the network’s ability to reconstruct facial texture features. We use the IAM to focus on important positions and channels, which makes the network more effective at restoring key face structures like eyes and mouths. The above two attention modules form the multi-phase attention mechanism. Extensive experiments show that our MPAN has a significant competitive advantage over other state-of-the-art networks on various scale factors using various metrics, including PSNR and SSIM. Overall, our proposed Multi-phase Attention mechanism significantly improves the network for recovering face HR images without using additional information.


Introduction
Face super-resolution, a crucial component of the image super-resolution method, is the procedure that recovers the high-resolution face image from the input low-resolution face image. Since face super-resolution is a vital image restoration task and is broadly used in many situations, such as faces in surveillance videos and identity recognition, increasing attention and research focus on face super-resolution.
Recently, general super-resolution methods, including basic CNN-based methods, GANbased methods and so on, have progressed rapidly. For example, Li Z et al. proposed a feedback method SRFBN [1], which utilizes the recurrent neural network (RNN) to form the feedback mechanism in recovering the HR face images. Zhang M et al. proposed a pixel-wised GAN named SPGAN [2], which uses a discriminative matrix and a supervised pixel-wise adversarial loss to restore realistic face images. However, these general super-resolution methods have difficulty in recovering key face structures such as eyes and mouths. These structures only make up a small part of the face, but they require more attention for the network to recover. And they are often more challenging to recover than other parts because they contain large pixel changes. On the other hand, previous face super-resolution research mainly focuses on using additional information such as the prior and attribute information [3,4] to reconstruct the highresolution image. For example, Ma C et al. proposed the recursive cooperation method DIC [3], which utilizes the prior knowledge of landmark estimation to recover the face image, and uses a new attentive fusion module to improve the effect of landmark maps. Yu X et al. proposes an attribute-embedded method EFSRSA [4], which incorporates the facial attribute vectors into the autoencoder and utilizes the deconvolutional layers to upsample the feature maps. However, two main drawbacks exist: (1) More effort and computing resources are needed to obtain the additional information; (2) Effective additional information is challenging to get from the low-resolution image.
Moreover, if the face image is separated into many small parts and each part is regarded as an individual sample, we need to balance the interrelationship between parts that include key face structures and other parts which do not include key face structures, and retain the informative features. We also need to extract features from layers of different depths to recover face texture details. This suggests we need to propose a new network to solve these problems mentioned above.
Therefore, this paper introduces a Multi-phase Attention Network (MPAN), which constructs by stacking integrated residual attention groups (IRAG) for face super-resolution. The IRAG is composed of residual channel attention blocks (RCAB) [5] and an Integrated Attention Module (IAM). Each IRAG's IAM comprises a channel attention module and a spatial attention module. The channel attention module allocates different channel weights to extract critical information. The spatial attention module assigns a distinct weight to each position of the feature map. The IAM makes the network place a greater emphasis on important components such as the eyes and mouths. Our MPAN also applies the concatenated attention module (CAM) to weight feature layers of different depths. The concatenated attention module makes the network focus more on the informative layer, not the deeper layer of the network, leading the network more efficient in recovering face textures details. For this paper, the main contributions are as follows: 1. We propose a multi-phase attention network (MPAN) without relying on prior or attribute information. And our MPAN has an advantage over other face super-resolution methods in recovering face HR images. Also, it achieves better performance than other networks on various scale factors with metrics including PSNR and SSIM [6,7].

Face super-resolution methods
There are five categories of face super-resolution methods [8]. General face super-resolution methods concentrate on effective networks with various advanced structures such as residual block [9][10][11] and attention mechanisms [12] to enhance the effectiveness. General face superresolution methods can be further subdivided into four distinct groups: basic CNN-based methods, GAN-based methods, reinforcement learning-based methods, and ensemble learning-based methods. Prior-based face super-resolution methods first extract the prior facial information such as facial heatmaps [13], facial landmarks [14], and facial parsing maps. Then utilize them to reconstruct a clearer facial structure. Identity-preserving face super-resolution methods take full advantage of the identity information of face images to maintain identity consistency. Attribute-constrained face super-resolution fully exploits the facial semantic knowledge, for example, the description by the witness. Reference face super-resolution methods utilize several high-resolution face images of the low-resolution face image. These highresolution images may be used to restore facial photographs by providing identification information. But sometimes, it is tough to find the available reference image.
In recent decades, various face super-resolution methods have been presented. The pioneer CNN-based method for super-resolution was EDSR [15] which can also be applied in the face super-resolution. EDSR removes an unnecessary batch normalization module from the residual blocks to improve performance. Based on the EDSR, Haris M et al. first proposed the DBPN [16] which uses the deep back-projection to enhance the interdependence between low and high-layer image features. Zhang Y et al. first introduced the RDN [17] which uses dense connections to learn more effective features from the previous parts. The previously mentioned three methods mainly focus on residual blocks and skip connections. Attention mechanisms are also applied in face super-resolution. Chen Y et al. proposed a face image superresolution method [18], which applies the channel attention mechanism on feature maps. This method extracts features from LR images, reallocates channel features, and recovers HR images at various scales. The latest proposed methods restore the high SR from the low SR image base on extra facial information such as facial landmarks, facial heatmap, and facial parsing maps. Chen C et al. proposed a progressive semantic-aware style transformation method PSFR-GAN [19], which takes advantage of the parsing maps and pixel space features from LR face images. This method also has a semantic aware loss function that computes the semantic region loss to recover the face key structures better. However, the mentioned methods either lead to the loss of key face structure details in intermediate feature layers [20] because of the very deep depth or generate some artifacts or unreal details. We propose a multi-phase attention network that contains the concatenated attention module and the integrated attention module to reallocate the features across layers, channels, and positions.

Attention network
Various networks that use attention mechanisms are introduced to solve the vision tasks. The key point of the attention mechanism is to utilize the attention map to reweight features in the network [21]. Hu J et al. first proposed a squeeze and excitation network (SENet) [22], which introduced attention mechanisms and modularity first compared to other networks. Wang Q et al. proposed a Efficient Channel Attention for Deep Convolutional Neural Networks (ECA--Net) [23], which uses the convolution operation to generate the channel attention map. Woo S et al. proposed a convolutional block attention module (CBAM) [24], which uses spatial attention and channel attention in sequence. Park J et al. proposed a Bottleneck Attention Module (BAM) [25], which consists of spatial attention and channel attention and places them in two separate paths. Mugeet A et al. proposed a Multi-Attentive Feature Fusion Super-Resolution Network (MAFFSRN) [26], which constructs by stacking muti-attention blocks. The multiattention block integrates and enhances channel and spatial attention. Zhang Y et al. proposed the residual channel attention network (RCAN) [5], which applies not only residual structure with long skip connections but also channel attention to allocate weight to different channels. Dai T et al. proposed a second-order attention network (SAN) [27], which combines long-distance interdependences with the entire network structure information and obtains remarkable performance. Wang C et al. proposed a two-step face super-resolution network (FishSRNet) [28], which utilizes face prior knowledge. And the FishSRNet uses multi-scale channel mechanisms and spatial attention mechanisms. Zhao H et al. proposed a lightweight and effective network (PAN) [29], which is composed of a novel pixel attention mechanism. Pixel attention works the same way as channel attention and spatial attention. But it applies the 3D attention maps and has fewer parameters. However, these previously mentioned algorithms only focus on the channel and spatial interdependences and neglect the correlations between layers of different depths. As a result, we propose the multi-phase attention network to fully utilize the feature interrelations between hierarchical layers.

Ethics statement
The individuals pictured in Figs 1, 5-9 have provided written informed consent (as outlined in PLOS consent form) to publish their image alongside the manuscript.

Network architecture
As illustrated in Fig 1, our proposed MPAN is primarily composed of four major components: shallow feature extraction, attention in residual (AIR), concatenated attention module (CAM) and the final reconstruction part.
Shallow feature extraction. We refer to I LR and I SR as the input and output of MPAN. As investigated in [15,30], we extract the shallow feature F 0 from the LR input using a convolutional layer: Where the H SE represents the convolution operation. Then the F 0 is utilized as the input of the backbone attention in residual (AIR).
Attention in residual (AIR). The details of our AIR structure are shown in Fig 1. The AIR structure consists of G integrated residual attention groups (IRAG) and G skip connections (SC). Every IRAG further consists of B residual channel attention blocks (RCAB) [5] and an Integrated Attention Module (IAM). Simply stacking many IRAGs may lead to bad performance. So the skip connections (SC) are used to stabilize the training of the deep network [31]. The function of the first IRAG in the AIR structure can be represented as: And the other IRAG in the AIR structure can be represented as: F i denotes the i-th IRAG's output. G is the number of IRAG. And H i is the function of the ith IRAG. So the output of AIR structure is formulated as: F G is the output of the last IRAG. And F DF is the output of the AIR structure. The W CF is the weight assigned to the Conv layer at the AIR's tail.
Concatenated attention module (CAM). Previous AIR structure extracts the hierarchical features F i . To model the feature correlations between layers, we further propose a concatenated attention module (See Section3.2 in detail) that weights the different layers of depths.
The proposed concatenated attention fully utilizes features from all previous layers and is formulated as: where H CA denotes the CAM. The CAM scales the fused intermediate features F i which are generated by output features of IRAG. Consequently, high-contribution feature layers in CAM are enhanced, while redundant feature layers are suppressed. Image reconstruction. After processing features in the previous structure, we perform an element-wise summation to combine output features. To convert the scale sampling, we next apply the sub-pixel convolution, which acts as the upsampling module. The sub-pixel convolution utilizes an array of upscaling filters to convert LR feature maps to HR output, reducing the upscale operation's computational complexity. Then the upscaled features are then performed a convolution operation. The whole process can be represented as: Where the H REC represents the reconstruction operation which includes the sub-pixel convolution and the convolution operation. The inputs of the H REC are F 0 + F L + F DF . The I SR is the operated SR output.

Concatenated attention module
Although dense skip connections are used to learn more effective features from shallow layers and stabilize the training of the deep network [10], the interdependences between layers are not fully utilized. Thus, we propose a novel CAM that learns the interrelationship of layer features of various depths to enhance the performance of network representation. Specifically, the CAM forms feature maps of feature groups into two vector matrices and constructs interdependencies between different feature layers. The CAM allocates distinctive attention weight to the layer features, improving the feature representation capability. Fig 2 depicts the structure of the CAM. The input consists of discrete feature groups taken from G integrated residual attention groups(IRAG). Then we do the concatenation operation. The dimension of the concatenation output is G × H × W × C. Then, we restructure the feature groups into two 2D matrices using convolution. One matrix's dimensions is G × HWC. The other is HWC × G. Also, we do the matrix multiplication to the previous two matrices to calculate the interdependencies of different feature layers Where φ(�) and θ(�) represent the softmax and reshape operation. i denotes the i-th row of reshaped feature groups matrix. j denotes the j-th column of reshaped transpose feature groups matrix. w ij denotes the corresponding coefficients between i-th and j-th extracted feature layers. Next, we multiply the calculated coefficient matrix with the first reshaped feature groups matrix using matrix multiplication: Where FG kj denotes the k-th row j-th column element of the feature groups matrix. F ij denotes the i-th row j-th column element of output of the matrix multiplication. At last, we reshape the output of the matrix multiplication and add the input feature groups. Overall, the proposed CAM structure makes the network concentrate on more informative and intermediate layer features.

Integrated attention module
The existing channel attention networks [22,23] construct a weight correlation matrix and allocate different weights to channels, with little consideration of the spatial features. On the other hand, spatial attention networks [24,25] neglect the distinctive weights of channels. As a result, we propose an integrated attention module (IAM) that fuses the channel and spatial attention modules to increase accuracy.
Every IRAG consists of B residual channel attention blocks (RCAB) [5] and an Integrated Attention Module (IAM). The input of the first IRAG is F 0 . And the output features of RCABs in the first IRAG are H RCAB (F 0 ). The output of the first IRAG is F 1 . Here we mainly focus on the condition i � 2. So the input of the i-th IRAG is F i-1 + F 0 . And the output features of RCABs in the i-th IRAG are H RCAB (F i-1 + F 0 ). H RCAB represents the function of B residual channel attention blocks (RCAB). We perform the IAM for the output features of the RCAB, as shown in Fig 1. The structure of our proposed IAM is shown in Fig 3. We perform the channel attention operation to obtain the channel attention weight W CA . The dimension of W CA is C × 1 × 1. This process contains two fully connected layers and a batch normalization operation which can be represented as: Where the FL represents the fully connected layer. And the BN represents the batch normalization operation. We also perform the spatial attention operation to generate the spatial attention weight W SA . The dimension of W SA is 1 × H × W. This process contains two convolutions and two dilated convolutions which can be represented as: We replicate the number of the channel attention weight W CA to HW. And rearrange these channel attention weights to get a new channel attention weight whose dimension is C × H × W. At the same time, we replicate the number of the spatial attention weight W SA to C. And stack these spatial attention weights to get a new spatial attention weight whose dimension is C × H × W. Then we do the element-wise summation to fuse [32] two new attention weights and get the W IA . Moreover, we do element-wise multiplication with the feature maps H RCAB (F i-1 + F 0 ) and the correlation weights matrix W IA . At last, we add the feature map H RCAB (F i-1 + F 0 ) with the output of the element-wise multiplication to obtain the weighted features: Where � denotes the element-wise multiplication. Thus, F i is the weighted operated output performed by the integrated attention module. Unlike the traditional channel attention and spatial attention module, our IAM selectively learns the inter-channel and in-channel features by constructing a channel and spatial adjusted independent weights intercorrelation.

The proposed algorithm of MPAN
The Fig 4 provides the pseudocode of MPAN in a PyTorch-like style. This algorithm includes three parts: the main part, the CAM function and the IRAG function. The main part provides the code for one batch data training including forward propagation, loss computing [33] and back propagation. The CAM function provides the code for concatenation attention construction and application. The IRAG function provides the code for RCAB modules and integrated attention construction and application.

Experiments
This section begins by analyzing the contributions of the two proposed attention modules. Then we compare our MPAN algorithm to the most advanced algorithms using our test dataset. Results on more images are shown in the additional part.

Settings
Datasets. The FFHQ [34] is a high-quality human faces dataset. It is composed of 7000 high-quality face images that were downloaded in 1024x1024 resolution from the internet and used for various human face tasks such as human face detection and super-resolution. We create our dataset by ourselves. As the training set, choose 22,000 photos of individuals of varying ages and sexes from the FFHQ dataset. Additionally, choose 2000 images for the testing set. The face images in the training set and testing set are entirely distinct [35]. We resize the images to 128x128 with bicubic interpolation operation [36] as the ground-truth HR images. We transform the restored RGB result into YCbCr space. The final results are assessed on metrics such as PSNR and SSIM [37] on the luminance Y channel [38]. Implementation details. We use PyTorch [39] platform to implement the multi-phase attention network. In our experiment, the patch size is set as 96

Ablation study regarding the proposed CAM and IAM
The proposed CAM and IAM generate weight correlations matrix on different layers, channels, and spatial. But we need to know the real effects of our two modules. We conduct experiments without using CAM and IAM on our face dataset with a scale factor of 4x and make comparisons. Table 1 presents the quantitative results. The model with CAM achieves 0.06 dB greater PSNR performance than the baseline model without the CAM and IAM modules, whereas the model with IAM simulates 0.16 dB. The model using both CAM and IAM modules performs

PLOS ONE
best on PSNR on scale factor X4. Fig 5 represents the vital effectiveness of face structures and details using both CAM and IAM modules.

Ablation study regarding the number of integrated residual attention group
In this paper, the integrated residual attention group (IRAG), the main part of the network, is composed of residual channel attention blocks (RCAB) and the IAM. The IAM integrates channel attention and spatial attention to promote the network representation ability. We study the effect of IRAG numbers in MPAN. Specifically, we set the IRAG number equal to one, three, five, and ten. And the PSNR and SSIM results on scale factors X4 are shown in Table 2. The evaluation is apparently improved with the increase of IRAG numbers. This ablation study demonstrates the significant impact of IRAG modules. So we choose option ten IRAG which obtains the best scores as part of our network.

Ablation study regarding the number of residual channel attention block
We conduct an ablation research to determine the optimal number of residual channel attention blocks (RCAB) to feed to the proposed integrated residual attention group (IRAG) module. Specifically, we apply five, ten, fifteen and twenty RCAB in each IRAG module and evaluate our network on the testing dataset. As indicated in Table 3. we compare our four

PLOS ONE
types of models based on the X4 scale factor. Clearly, when the number of RCAB increases, the PSNR and SSIM values on test data increase as well. This ablation study illustrates the effectiveness of RCAB. So option twenty RCAB, which has the best performance, is our choice for the network.

Comparisons with state-of-the-art methods
We compare our proposed network with state-of-the-art methods, including SR methods like EDSR [15], RDN [17], RCAN [5], SPARNet [44], and SAN [27]. We utilize the open-source code of the models above and train them using the same dataset.
Overall results. Quantitative comparisons of X2, X3, and X4 on our face image dataset are shown in Table 4. In terms of PSNR and SSIM scores, our MPAN performs better than other state-of-the-art approaches. With two proposed attention modules, our MPAN recovers face details structure better. It is shown in Fig 6 that most other state-of-art methods are unable to recover the nose and eyes accurately and suffer from blurring details. But MPAN obtains shaper results similar to the ground truth HR images.
Detailed comparisons.
a. Attention mechanisms. RCAN, SPARNet, SAN, and MPAN are the methods being compared that involve attention mechanisms. RCAN is designed and widely used for various SR tasks. It is comprised of a channel attention mechanism that adaptively rescales channelwise features by assigning distinct channel weights. Channel attention has been proven to be effective for conventional SR tasks, but spatial attention is substantially more advantageous when addressing low-resolution face super-resolution issues. This is why the other three methods, which contain spatial attention mechanisms, perform better than RCAN shown in Table 4.
b. Compared with RDN. The RDN makes full use of hierarchical features from layers of different depths by using the residual dense block which includes dense connections from lower to higher layers. The RDN does not use any attention mechanism. In contrast, our MPAN uses not only the layer, channel, and spatial attention mechanisms but also skip connections. Our MPAN outperforms RDN on evaluation metrics and recovering key face structures shown in Table 4 and Fig 7, demonstrating the attention mechanism's superiority.
c. Compared with SPARNet. The SPARNet is composed of stacking Face Attention Units (FAUs) which consist of the attention branch and the feature branch. The attention branch utilizes the spatial attention mechanism to focus more on feature-rich face regions. However, our MPAN integrates the channel and spatial attention mechanism to form the integrated attention mechanism, making the network pay more attention to the key face structures. Also, our MPAN rescales the features of layers to form concatenate attention

PLOS ONE
mechanism, allowing the network to reconstruct facial texture information. So our MPAN performs better than SPARNet shown in Table 4 and Fig 8. d. Compared with SAN. The SAN uses both the channel and spatial attention mechanism. It places the spatial attention module at the front and end of the network and places the channel attention module in every part of the network backbone. Our MPAN also uses layer, channel, and spatial attention mechanisms. But the specific implementation and usage are different. Our MPAN places the channel and spatial attention mechanism in each part of the network. And use the layer attention mechanism to allocate different weights to layers, which is beneficial to restoring texture details. Thus, our MPAN outperforms SAN on all upscale factors clearly shown in Table 4. Our MPAN can recover more face details than SAN shown in Fig 9. e. Compared with different upscale factors. Our models are trained using X2, X3, and X4 upscale factors. The X2 model was trained from scratch, while the X3 and X4 model were initialized with the pre-trained X2 model. We compare the SR results of a different upscale factor in Table 4. MPAN achieves the most significant outcomes for all upscale factors, particularly the X2 upscale factor. Among all upscale factors, all the models obtain the best results for the X2 upscale factor and the worst ones for the X4 upscale. So the models can reconstruct face images better with enough face features.

Conclusions
In this paper, we build the Multi-phase Attention Network (MPAN) for face super-resolution, which rescales the features among different layers, channels, and positions. Specifically, The integrated residual attention groups (IRAG) build the basic block of the MPAN. The concatenated attention module (CAM) reallocates dependencies among layers of different depths. The integrated attention module (IAM) incorporates features of channels and positions. The above two attention modules form the multi-phase attention to further enhance the performance of recovering face HR images. The extensive experiment demonstrates that our MPAN performs better than other state-of-the-art methods in terms of PSNR and SSIM, making the restored SR face images more realistic to the real HR face images.